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Abstract 

In this paper, we introduce a modified collaborative filtering (MCF) algorithm, which has 
remarkably higher accuracy than the standard collaborative filtering. In the MCF, instead of 
the cosine similarity index, the user-user correlations are obtained by a diffusion process. 
Furthermore, by considering the second-order correlations, we design an effective algo- 
rithm that depresses the influence of mainstream preferences. Simulation results show that 
the algorithmic accuracy, measured by the average ranking score, is further improved by 
20.45% and 33.25% in the optimal cases of MovieLens and Netflix data. More importantly, 
the optimal value Aopt depends approximately monotonously on the sparsity of the training 
set. Given a real system, we could estimate the optimal parameter according to the data 
sparsity, which makes this algorithm easy to be applied. In addition, two significant criteria 
of algorithmic performance, diversity and popularity, are also taken into account. Numer- 
ical results show that as the sparsity increases the algorithm considered the second-order 
correlation can outperform the MCF simultaneously in all three criteria. 
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1 Introduction 



With the expansion of the Internet services, people are becoming increasingly de- 
pendent on the Internet with an information overload. Consequently, how to ef- 
ficiently help people find information that they truly need is a challenging task 
nowadays [1]. Being an effective tool to address this problem, the recommender 
system has caught increasing attention and become an essential issue in Internet 
applications such as e-commerce system and digital library system [2]. Motivated 
by the practical significance to the e-commerce and society, the design of an effi- 
cient recommendation algorithm becomes a joint focus from engineering science to 
mathematical and physical community. Various kinds of algorithms have been pro- 
posed, such as correlation-based methods [3,4], content-based methods [5,6,7,8], 
spectral analysis [9,10], iteratively self-consistent refinement [11], principle com- 
ponent analysis [12], network-based methods [13,14,15,16], and so on. For a review 
of current progress, see Ref. [17,18] and the references therein. 

One of the most successful recommendation algorithms, called collaborative fil- 
tering (CF), has been developed and extensively investigated over the past decade 
[3,4,19]. When predicting the potential interests of a given user, such approach 
firstly identifies a set of similar users from the past records and then makes a pre- 
diction based on the weighted combination of those similar users' opinions. Despite 
its wide applications, collaborative filtering suffers from several major limitations 
including system scalability and accuracy [20]. Recently, some physical dynam- 
ics, including mass diffusion (MD) [14,15,21] and heat conduction (HC) [13], have 
found their applications in personalized recommendations. Based on MD and HC, 
several effective network-based recommendation algorithms have been proposed 
[13,14,15,16]. These algorithms have been demonstrated to be of both high accu- 
racy and low computational complexity. However, the algorithmic accuracy and 
computational complexity may be very sensitive to the statistics of data sets. For 
example, the algorithm presented in Ref. [15] runs much faster than the standard CF 
if the number of users is much larger than that of objects, while when the number 
of objects is huge, the advantage of this algorithm vanishes because its complexity 
is mainly determined by the number of objects (see Ref. [15] for details). Since the 
CF algorithm has been extensively applied in the real e-commerce systems [4,22], 
it's meaningful to find some ways to increase the algorithmic accuracy of CF. We 
therefore present a modified collaborative filtering (MCF) method, in which the 
user correlation is defined based on the diffusion process. Recently, Liu et al. [23] 
studied the user and object degree correlation effect to CF, they found that the al- 
gorithm accuracy could be remarkably improved by adjusting the user and object 
degree correlation. In this paper, we argue that the high-order correlations should 
be taken into account to depress the influence of mainstream preferences and the 
accuracy could be improved in this way. The correlation between two users is, in 
principle, an integration of many underlying similar tastes. For two arbitrary users, 
the very specific yet common tastes shall contribute more to the similarity mea- 
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Fig. 1. Illustration of the user correlation network. The users A, B and C are correlated 
because they have collected some common objects, where object 1 has been collected by 
all of the three users, while object 2 is only collected by user A and C. 

sure than those mainstream tastes. Figure 1 shows an illustration of how to find 
the specific tastes by eliminating the mainstream preference. To the users A and C, 
the commonly selected objects 1 and 2 could reflect their tastes, where 1 denotes 
the mainstream preference shared by all A, B and C, and 2 is the specific taste of 
A and C. Both 1 and 2 contribute to the correlation between A and C . Since 1 is 
the mainstream preference, it also contributes to the correlations between A and 
B, as well as B and C . Tracking the path A ^ B ^ C, the mainstream pref- 
erence 1 could be identified by considering the second-order correlation between 
A and C. Statistically speaking, two users sharing many mainstream preferences 
should have high second-order correlation, therefore we can depress the influence 
of mainstream preferences by taking into account the second-order correlation. The 
numerical results show that the algorithm involving high-order correlations is much 
more accurate and provides more diverse recommendations. 



2 Problem description and performance metrics 

Denote the object set as O = {oi, 02, ■ ■ ■ , Om} and the user set as t/ = M2, 
• • ■ , Un}, a recommender system can be fully described by an adjacent matrix 
A = [aij] G i?™'", where aij = 1 if Oj is collected by uj, and aij = otherwise. 
For a given user, a recommendation algorithm generates an ordered list of all the 
objects he/she has not collected before. 

To test the recommendation algorithmic accuracy, we divide the data set into two 
parts: one is the training set used as known information for prediction, and the 
other one is the probe set, whose information is not allowed to be used. Many 
metrics have been proposed to judge the algorithmic accuracy, including precision 
[17], recall [17], F -measure [3], average ranking score [15], and so on. Since the 
average ranking score does not depend on the length of recommendation list, we 
adopt it in this paper. Indeed, a recommendation algorithm should provide each 
user with an ordered list of all his/her uncollected objects. For an arbitrary user 
Ui, if the entry Ui-Oj is in the probe set (according to the training set, Oj is an 



3 



uncollected object for Wj), we measure the position of Oj in the ordered list. For 
example, if there are Lj = 100 uncollected objects for Ui, and Oj is the 10th from 
the top, we say the position of Oj is 10/100, denoted by rij = 0.1. Since the probe 
entries are actually collected by users, a good algorithm is expected to give high 
recommendations, leading to small Vij. Therefore, the mean value of the position 
Tij, (r) (called average ranking score [15]), averaged over all the entries in the 
probe, can be used to evaluate the algorithmic accuracy: the smaller the ranking 
score, the higher the algorithmic accuracy, and vice verse. For a null model with 
randomly generated recommendations, (r) = 0.5. 

Besides accuracy, the average degree of all recommended objects, (A;), and the 
mean value of Hamming distance, S, are taken into account to measure the al- 
gorithmic popularity and diversity [16]. The smaller average degree, corresponding 
to the less popular objects, are preferred since those lower-degree objects are hard 
to be found by users themselves. In addition, the personal recommendation algo- 
rithm should present different recommendations to different users according to their 
tastes and habits. The diversity can be quantified by the average Hamming distance, 
S = (Hij), where Hij = 1 — Qij/L, L is the length of recommendation list, and 
Qij is the overlapped number of objects in Ui's and Uj's recommendation lists. The 
higher S indicates a more diverse and thus more personalized recommendations. 



3 Modified collaborative filtering algorithm based on diffusion process 

In the standard CF, the correlation between Ui and uj can be evaluated directly by 
the well-known cosine similarity index 



where k{ui) = cm is the degree of user Ui. Inspired by the diffusion process 
presented by Zhou et al. [15], the user correlation network can be obtained by 
projecting the user-object bipartite network. How to determine the edge weight 
is the key issue in this process. We assume a certain amount of resource (e.g., 
recommendation power) is associated with each user, and the weight Sij represents 
the proportion of the resource uj would like to distribute to ?/,. This process could 
be implemented by applying the network-based resource-allocation process [24] on 
a user-object bipartite network where each user distributes his/her initial resource 
equally to all the objects he/she has collected, and then each object sends back what 
it has received to all the users who collected it, the weight (the fraction of initial 
resource uj eventually gives to Ui) can be expressed as: 
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Fig. 2. The optimal Aopt and the improvement (IP) vs. the sparsity of the training sets. All 
the data points are averaged over ten independent runs with different data-set divisions. The 
results corresponding to Netflix data are marked. 



where k{oi) = 1]"=! ^^h denotes the degree of object oi. For the user-object pair 
{ui, Oj), if Ui has not yet collected oj (i.e., aji — 0), the predicted score, is given 
as 



Vij = 



(3) 



Based on the definitions of and Vij, given a target user Ui, the MCF algorithm is 
given as following 

(i) Calculating the user correlation matrix {sij} based on the diffusion process, as 
shown in Eq. (2); 

(ii) For each user Ui, based on Eq. (3), calculating the predicted scores for his/her 
uncollected objects; 

(iii) Sorting the uncollected objects in descending order of the predicted scores, 
and those objects in the top will be recommended. 



The standard CF and the MCF have similar process, and their only difference is 
that they adopt different measures of user-user correlation (i.e., s^j for the standard 
CF and for MCF). 
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Fig. 3. Average degree of recommended objects, {k), vs. A when p = 0.9. Squares, circles 
and triangles represent lengths L = 10, 20 and 50, respectively. The black point (•) corre- 
sponds to the average degree obtained by the standard CF with L = 20. All the data points 
are averaged over ten independent runs with different data-set divisions. 

4 Numerical results of MCF 



We use two benchmark data sets, one is MovzeLe«.sH, which consists of 1682 
movies (objects) and 943 users. The other one is Netfli^i^ which consists of 3000 
movies and 3000 users (we use a random sample of the whole Netflix dataset). The 
users vote movies by discrete ratings from one to five. Here we applied a coarse- 
graining method [15,16]: A movie is set to be collected by a user only if the giving 
rating is larger than 2. In this way, the MovieLens data has 85250 edges, and the 
Netflix data has 567456 edges. The data sets are randomly divided into two parts: 
the training set contains p percent of the data, and the remaining 1 — p part consti- 
tutes the probe. 

Implementing the standard CF and MCF when j9 = 0.9, the average ranking scores 
on MovieLens and Netflix data are improved from from 0.1 168 to 0.1038 and from 
0.2323 to 0.2151, respectively. Clearly, using the simply diffusion-based simlarity, 
subject to the algorithmic accuracy, the MCF outperforms the standard CF The 
corresponding average object degree and diversity are also improved (see Fig. 3 and 
Fig. 4 below). 



^ lhttp://www.grouplens.org' 
^ http://www.netflixprize.com 
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Fig. 4. S vs. A when p = 0.9. Squares, circles and triangles represent the lengths L = 10, 20 
and 50, respectively. The black point (•) corresponds to the diversity obtained by the stan- 
dard CF with L = 20. AH the data points are averaged over ten independent runs with 

different data-set divisions. 

5 Improved algorithm 



To investigate the effect of second-order user correlation to algorithm performance, 
we use a linear form to investigate the effect of the second-order user correlation to 
MCF, where the user similarity matrix could be demonstrated as 

H = S + AS^ (4) 

where H is the newly defined correlation matrix, S = {sij} is the first-order cor- 
relation defined as Eq. (2), and A is a tunable parameter. As discussed before, we 
expect the algorithmic accuracy can be improved at some negative A. 

When p = 0.9, the algorithmic accuracy curves of MovieLens and Netflix have 
clear minimums around A — —0.82 and A = —0.84, which strongly support the 
above discussion. Compared with the routine case (A = 0), the average ranking 
scores can be further reduced to 0.0826 (improved 20.45%) and 0.1436( improved 
33.25%) at the optimal values. It is indeed a great improvement for recommenda- 
tion algorithms. Since the data sparsity can be turned by changing p, we investigate 
the effect of the sparsity on the two data sets respectively, and find that although we 
test the algorithm on two different data sets, the optimal Aopt are strongly correlated 
with the sparsity in a uniform way for both MovieLens and Netflix. Figure 2 shows 
that when the sparsity increases, Aopt will decrease, and the improvement of the av- 
erage ranking scores will increase. These results can be treated as a good guideline 
for selecting optimal A of different data sets. Figure 3 reports the average degree 
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Table 1 

Algorithmic performance for MovieLens data when p = 0.9. The precision, diversity and 
popularity are corresponding to L = 50. NBI is an abbreviation of the network-based rec- 
ommendation algorithm, proposed in Ref. [15]. Heter-NBI is an abbreviation of NBI with 
heterogenous initial resource distribution, proposed in Ref. [16]. CB-CF is an abbreviation 
of the correlation-based collaborative filtering method, proposed in Ref. [23]. Improved 
MCF is an abbreviation of the algorithm presented in this paper. The parameters in Heter- 
NBI and IMCF are set as the ones corresponding to the lowest ranking scores (for Heter- 
NBI [16], /?opt = -0.80; for CB-CF [23], Aopt = -0.96; for IMCF, Aopt = -0.82). Each 
number presented in this table is obtained by averaging over ten runs, each of which has an 
independently random division of training set and probe. 



Algorithms 


(r) 


S 


{k) 


CRM 


0.1390 


0.398 


259 


CF 


0.1168 


0.549 


246 


NBI 


0.1060 


0.617 


233 


Heter-NBI 


0.1010 


0.682 


220 


CB-CF 


0.0998 


0.692 


218 


IMCF 


0.0877 


0.826 


175 



of all recommended objects as a function of A. One can see from Fig. 3 that when 
p = 0.9 the average object degree is positively correlated with A, thus to depress 
the influence of mainstream interests gives more opportunity to the less popular 
objects, which could bring more information to the users than the popular ones. 
When the list length, L, bing equal to 20, at the optimal point Aopt — —0.82, the 
average degree is reduced by 29.3% compared with the standard CF. Whenp = 0.9, 
Fig. 4 exhibits a negative correlation between S and A, indicating that to consider 
the second-order correlations makes the recommendation lists more diverse. When 
L = 20, the diversity S is increased from 0.592 (corresponding to the standard CF) 
to 0.880 (corresponding to the case A — —0.82 in the improved algorithm). Figure 
3 and Figure 4 show how the parameter A affects the average object degree (k) and 
diversity S, respectively. Clearly, the smaller A leads to less popularity and higher 
diversity, and thus the present algorithm can find its advantage in recommending 
novel objects with diverse topics to users, compared with the standard CF. Gener- 
ally speaking, the popular objects must have some attributes fitting the tastes of the 
masses of the people. The standard CF may repeatedly count those attributes and 
assign more power for the popular objects, which increases the average object de- 
gree and reduces the diversity. The present algorithm with negative A can to some 
extent eliminate the redundant correlations and give higher chances to less popular 
objects and the objects with diverse topics different from the mainstream [25]. 
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6 Conclusions 



In this paper, a modified collaborative filtering algorithm is presented to improve 
the algorithmic performance. The numerical results indicate that the usage of dif- 
fusion based correlation could enhance the algorithmic accuracy. Furthermore, by 
considering the second-order correlations, S^, we presented an effective algorithm 
that has remarkably higher accuracy. Indeed, when p = 0.9 the simulation results 
show that the algorithmic accuracy can be further improved by 20.45% and 33.25% 
on MovieLens and Netflix data. Interestingly, we found even for different data sets, 
the optimal value of A exhibits a uniform tendency versus sparsity. Therefore, if 
we know the sparsity of the training set, the corresponding optimal Aopt could be 
approximately confirmed. In addition, when the sparsity gets less than 1%, the im- 
proved algorithm wouldn't be effective any more, while as the sparsity increases, 
the improvement of the presented algorithm is enlarged. 

Ignoring the degree-degree correlation in user-object entries. The algorithmic com- 
plexity of MCF is 0{m{ku}{ko) + mn{ko)), where (ky) and (ko) denote the av- 
erage degrees of users and objects. The first term accounts for the calculation of 
user correlation, and the second term accounts for the one of the predictions. It 
approximates to 0{mn{ko)) for n 3> (/c„). Clearly, the computational complex- 
ity of MCF is much less than that of the standard CF especially for the systems 
consisted of huge number of objects. In the improved algorithm, in order to cal- 
culate the second-order correlations, the diffusion process must flow from the user 
to the objects twice, therefore, the algorithmic complexity of the improved algo- 
rithm is 0{n{ku)'^ {koY + mn{ko)). Since the magnitude order of the object m is 
always much larger than the ones of (ku) and (ko), the improved algorithm is also 
as comparably fast as the standard CF. 

Beside the algorithmic accuracy, two significant criteria of algorithmic performance, 
average degree of recommended objects and diversity, are taken into account. A 
good recommendation algorithm should help the users uncovering the hidden (even 
dark) information, corresponding those objects with very low degrees. Therefore, 
the average degree is a meaningful measure for a recommendation algorithm. In ad- 
dition, since a personalized recommendation system should provide different rec- 
ommendations lists according to the user's tastes and habits [2], diversity plays 
a crucial role to quantify the personalization [26,27]. The numerical results show 
that the present algorithm outperforms the standard CF in all three criteria. How 
to automatically find out relevant information for diverse users is a long-standing 
challenge in the modem information science, we believe the current work can en- 
lighten readers in this promising direction. 

How to automatically find out relevant information for diverse users is a long- 
standing challenge in the modem information science, the presented algorithm also 
could be used to find the relevant reviewers for the scientific papers or funding ap- 
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plications [28,29], and the link prediction in social and biological networks [30,31]. 
We believe the current work can enlighten readers in this promising direction. 
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